The Effect of COVID-19 on Global Temperature and Greenhouse Gas Atmospheric Concentration: an EDA Approach
Last modified: 07 Dec 2022
1 Background
Coronavirus disease 2019 (COVID-19), caused by the SARS-CoV-2 virus, was declared as a public health emergency of international concern on January 30, 2020, and later as a pandemic on March 11 2020 by the World Health Organization (WHO). The announcement of the pandemic lead to widespread lockdowns, reduced travel, and decreased energy usage. To date, the WHO still considers COVID-19 to be an ongoing pandemic, with over 640 million cases reported worldwide.
The effects of COVID-19 have permeated all aspects of society–including our contributions to climate change. For example, during the core 2020 lockdown period, weekly global CO2 emissions were decreased relative to 2019, with an associated 6.4% decrease in CO2 emissions compared to the year prior [1]. A report from 2020 indicated 50% of the world’s population decreased their travel activity by at least 50% at the beginning of the pandemic in April, which coincided with up to a 30% decrease in nitrogen oxide (NO/NO2) emissions for this month [2]. Nitrogen oxide emissions in East China dropped by an average of about 50% during the initial lockdown period (January 23 - February 9) as compared to the 22 days prior [3]. While recent reports have illustrated the resurgence of greenhouse gas emissions following the relaxing of restrictions, it is clear that the COVID-19 lockdown had a positive impact on climate change, if only temporarily.
Given the many scientific and news reports on the effect of lockdowns on the environment (including this popular report covering the water in the canals of Venice, Italy running clear during lockdown), we were interested in making our own investigation into changes of environmental markers during COVID-19, with the goal of answering the question: is there a noticeable change in greenhouse gas concentrations and global temperature during the COVID-19 period as compared to other years? We defined the COVID-19 window as December 2019 - July 2021 for this purpose; December 2019 was when the first reported case of COVID occurred, and July 2021 is approximately when half the U.S. population had been vaccinated at least once.
To answer this question, we acquired monthly average global greenhouse gas concentration data from NOAA for the following gases:
- Methane (CH4), emitted prominently from landfills, mining, and energy usage facilities
- Nitrous oxide (N2O), a byproduct of industrial activities
- Sulfur hexafluoride (SF6), used widely in the electricity industry for insulation
Additionally, monthly average global temperature anomaly data was acquired from the NASA GIS surface temperature analysis, consisting of 3 total datasets, each for the same period of years:
- Temperature anomalies from AIRS (Atmospheric InfraRed Sounder) v6
- Temperature anomalies from AIRS v7
- Temperature anomalies from GHCN (Global Historical Climatology Network) v4
Relationships between years were analyzed by clustering, heatmaps, and moving average plots; trends were observed and predicted with simple linear models. See Section 2 to learn more about the data, and Section 2 to see how the data was wrangled and assessed prior to visualization/EDA.
2 About the Data
2.1 Downloading and Filtering
Data were downloaded from NOAA as 3 fixed-width .txt files; one per gas. Data from NASA was contained in a single .csv file, consisting of the 3 separate datasets (anomalies recorded by each of the above instrumentation/software) split by separator strings.
NOAA data contains information encoded by the following fields (~ 1980 - present): 1) year, 2) month, 3) date in decimal, 4) average monthly gas concentration in mole fraction (parts per), 5) average uncertainty (a standard error measurement performed by NOAA), 6) trend in mole fraction, and 7) standard error of the trend.
NASA data contains information encoded as temperature anomaly in degrees C. For each month of each year (2002 - present), the difference in that month’s average temperature from the average temperature of that month during the period 2006 - 2017. Positive anomalies indicate the temperature was warmer than a long-term average; temperature anomalies are favored over absolute temperature for comparative analysis as they can more accurately capture variation of temperature across long durations of time, and are normalized to some frame of reference.
Since both data files contained extraneous textual information, the terminal-based text stream parsing program awk was used to filter the data as follows:
# NOAA data
awk '{if ($1 > 2002) print $0}' "${name}.txt" > "${name}-filtered.txt"
# NASA data
awk '/Global.*/{n++}{print > "nasa_gistemp_" n ".csv"}' ${FILENAME}
The former command will only save data from 2002 to the present; 2002 was chosen since NASA temperature data only goes back to 2002, and consistency among datasets was desired for the final output. The latter command splits the NASA GISTEMP parent .csv into 3, R-friendly .csv files.
2.2 Data Wrangling
Our interest was in investigating the properties of data in the COVID window (12/2019 - 07/2021) as compared to previous windows of comparable time. Since the long-term effect of COVID-19 on climate change is considered insignificant, we were hopeful that by choosing similar window “control” sizes, we could compare the short-term effects of COVID within a concentrated window to similar time periods. In effect, since we chose to deal with more granular time points, we did not want to necessarily smooth out the variation of previous years in, say, a 5-year average, as this would result in differences in the distributions of data being compared. This scheme was used for some, but not all, exploratory analysis. Therefore, a list of lists of data frames from the data was created in which each sub-list consisted of data frames for one dataset (e.g., one sublist for methane data, one sublist for nitrous oxide, etc.), and each sublist was a list of data frames broken up by year in 20-month windows:
- December 2003 - July 2005
- December 2005 - July 2007
- …
- December 2019 - July 2021
Structure of the original data; note that null values were encoded as “-9.99” for NOAA data and in asterisks (*****) for NASA data; these were converted to NA prior to analysis. Additionally, original NASA data was in “wide” format; in order to be able to create Date objects to construct time-series plots, data was pivoted to long format (shown below).
Code
##### Code from setup.r file
# Set up top-level project directory
# Note that this assumes one of the following:
#1) You have cloned the git repo with git clone
#2) You have a file system as defined in the GitHub repository
# I.e., similar to ~/Project/Datasets and ~/Project/Scripts
# Code from https://stackoverflow.com/questions/47044068/
getCurrentFileLocation <- function() {
this_file <- commandArgs() %>%
tibble::enframe(name = NULL) %>%
tidyr::separate(col = value,
into = c("key", "value"),
sep = "=",
fill = "right") %>%
dplyr::filter(key == "--file") %>%
dplyr::pull(value)
if (length(this_file) == 0) {
this_file <- rstudioapi::getSourceEditorContext()$path
}
return(dirname(this_file))
}
# Alternative is to use here() (from "here" package)
# Walks up dir hierarchy starting from wd during loading until it finds a dir satisfying either:
# 1) Contains file matching [.]Rproj$
# 2) Contains a .git directory
# Disadvantage is that it only works if you're in a directory *within* root
# Get directory of current file, which should be in Scripts sub-dir
current.dir <- file.path(getCurrentFileLocation())
setwd(current.dir)
setwd("../") # The directory immediately above should be project root
rm(current.dir)
# Setup project, scripts, and data directories
project.dir <- getwd()
scripts.dir <- "Scripts"
data.dir <- "Datasets"
# Set directory to datasets
setwd(file.path(project.dir, data.dir))
# Source import and wrangle files
source(file.path(project.dir, scripts.dir, "importdata.r"))
source(file.path(project.dir, scripts.dir, "gas_temp_wrangle.r"))
# Read in data
gas.data <- read_gas_data()
temp.data <- read_temp_data()Code
###### Code from the .r files called by setup.r script
#-----Importing NOAA greenhouse gas data-----#
# Used awk (see filtergas.sh) to pre-filter NOAA greenhouse gas data for 2003 - present
read_gas_data <- function() {
# Get list of gases of interest for which we have data
gas.list <- c("ch4", "n2o", "sf6")
gas.files <- lapply(gas.list, function(x) list.files(path = ".", pattern = paste0(x, ".*filtered\\.txt")))
# Read files into df list
gas.data <- lapply(gas.files, fread)
names(gas.data) <- gas.list
# CH4 is in ppm, N2O is in ppb, SF6 is in ppt - rename all colnames just to ppb first
gas.colnames <- c("Year",
"Month",
"Year_Day_decimal",
"Average_ppb",
"Uncertainty_avg",
"Trend",
"Uncertainty_trend")
gas.data <- lapply(gas.data, function(x) setNames(x, gas.colnames))
# Now rename colnames for ch4 and sf6
ppb.col <- which(gas.colnames == "Average_ppb")
colnames(gas.data$ch4)[ppb.col] <- "Average_ppm"
colnames(gas.data$sf6)[ppb.col] <- "Average_ppt"
return(gas.data)
}
#-----Importing NASA GISTEMP data-----#
# Used awk (see parsetemp.sh) to split the files into 3 sub-csv files
read_temp_data <- function() {
temp.files <- list.files(path = ".", pattern = "nasa_gistemp_[123][.]csv")
# Read files into df list - skip first line, which is string description
temp.data <- lapply(temp.files, function(x) read.csv(x, skip = 1, header = TRUE))
names(temp.data) <- c("airsv6", "airsv7", "ghcnv4")
return(temp.data)
}Code
# First 4 rows of each data frame
for (list.obj in list(gas.data, temp.data)) {
for (df in list.obj) {
print(head(df, n = 4L))
}
}## Year Month Year_Day_decimal Average_ppm Uncertainty_avg Trend
## 1: 2003 1 2003.042 1777.56 0.96 1774.74
## 2: 2003 2 2003.125 1774.83 1.04 1775.27
## 3: 2003 3 2003.208 1774.82 0.80 1775.81
## 4: 2003 4 2003.292 1775.74 1.26 1776.38
## Uncertainty_trend
## 1: 0.60
## 2: 0.59
## 3: 0.58
## 4: 0.56
## Year Month Year_Day_decimal Average_ppb Uncertainty_avg Trend
## 1: 2003 1 2003.042 317.29 0.12 317.24
## 2: 2003 2 2003.125 317.34 0.12 317.30
## 3: 2003 3 2003.208 317.40 0.12 317.36
## 4: 2003 4 2003.292 317.47 0.12 317.43
## Uncertainty_trend
## 1: 0.12
## 2: 0.12
## 3: 0.12
## 4: 0.12
## Year Month Year_Day_decimal Average_ppt Uncertainty_avg Trend
## 1: 2003 1 2003.042 5.11 0.01 5.11
## 2: 2003 2 2003.125 5.12 0.01 5.13
## 3: 2003 3 2003.208 5.15 0.01 5.15
## 4: 2003 4 2003.292 5.19 0.00 5.17
## Uncertainty_trend
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## Year Jan Feb Mar Apr May Jun Jul Aug Sep
## 1 2002 ******* ******* ******* ******* ******* ******* ******* ******* -0.094
## 2 2003 0.035 -0.069 -0.099 -0.140 -0.164 -0.203 -0.191 -0.147 -0.091
## 3 2004 0.059 0.107 -0.080 -0.173 -0.450 -0.229 -0.344 -0.326 -0.252
## 4 2005 0.041 -0.098 -0.009 -0.013 -0.006 0.012 -0.025 -0.111 -0.069
## Oct Nov Dec J.D D.N DJF MAM JJA SON
## 1 -0.203 -0.273 -0.243 ******* ******* ******* ******* ******* -0.190
## 2 0.022 -0.578 0.023 -0.134 -0.156 -0.093 -0.134 -0.181 -0.216
## 3 -0.236 -0.107 -0.140 -0.181 -0.167 0.063 -0.234 -0.299 -0.199
## 4 0.021 -0.022 0.026 -0.021 -0.035 -0.066 -0.009 -0.041 -0.024
## Year Jan Feb Mar Apr May Jun Jul Aug Sep
## 1 2002 ******* ******* ******* ******* ******* ******* ******* ******* 0.130
## 2 2003 0.133 -0.023 -0.014 0.029 0.085 0.020 0.017 0.027 0.074
## 3 2004 0.126 0.148 0.033 -0.017 -0.307 -0.075 -0.254 -0.261 -0.100
## 4 2005 0.138 0.048 0.127 0.114 0.110 0.108 0.029 -0.052 0.017
## Oct Nov Dec J.D D.N DJF MAM JJA SON
## 1 0.020 -0.021 -0.077 ******* ******* ******* ******* ******* 0.043
## 2 0.186 -0.361 0.193 0.031 0.008 0.011 0.033 0.022 -0.034
## 3 -0.072 0.041 -0.054 -0.066 -0.045 0.156 -0.097 -0.197 -0.044
## 4 0.103 0.048 0.081 0.073 0.061 0.044 0.117 0.028 0.056
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1 2002 0.051 0.103 0.090 -0.152 -0.068 -0.141 -0.051 -0.182 -0.092 -0.213
## 2 2003 0.027 -0.104 -0.199 -0.184 -0.105 -0.191 -0.083 -0.068 -0.100 -0.026
## 3 2004 -0.139 0.038 -0.162 -0.125 -0.343 -0.231 -0.400 -0.257 -0.223 -0.148
## 4 2005 0.018 -0.088 -0.053 -0.065 -0.082 -0.026 -0.046 -0.110 -0.012 -0.003
## Nov Dec J.D D.N DJF MAM JJA SON
## 1 -0.185 -0.242 -0.090 -0.080 0.011 -0.043 -0.125 -0.163
## 2 -0.241 0.069 -0.100 -0.126 -0.106 -0.163 -0.114 -0.123
## 3 -0.049 -0.174 -0.184 -0.164 -0.011 -0.210 -0.296 -0.140
## 4 -0.038 0.001 -0.042 -0.057 -0.081 -0.067 -0.061 -0.018
Structure of the wrangled data, with lists of lists:
Code
# Moving window function, which will subset a df into a list of dfs by desired month cycle
df_window_subset <- function(df, init.year.wind = 2003, wind.size = 2, final.init.year = 2019) {
moving.dfsub.list <- list()
while (init.year.wind <= final.init.year) {
end.year.wind <- init.year.wind + wind.size
# Filter for only years in the current window
# Then filter for only December, year X to Jul, year X + 2
moving.df <- df %>%
filter(Year >= init.year.wind & Year <= end.year.wind) %>%
filter(!((Year == init.year.wind & Month < 12) | (Year == end.year.wind & Month > 7)))
moving.dfsub.list[[paste(init.year.wind, end.year.wind, sep = "-")]] <- as.data.frame(moving.df)
init.year.wind <- init.year.wind + wind.size
}
return(moving.dfsub.list)
}
# Adds a column of Date objects to a df from cols named Year and Month. Assumes day is constant (1 by default)
add_date_col <- function(df, day = 1) {
df <- df %>% mutate(Date = as.Date(paste(Year, Month, day, sep = "-"),
format = "%Y-%m-%d"))
return(df)
}
tempdata.convert.na <- function(data.list) {
# For each data frame of temp.data, take each column and replace ***** with NA
data.list.na <- lapply(data.list,
function(df) as.data.frame(lapply(df, function(x) as.numeric(gsub(x, pattern = "\\*.*",
replacement = NA)))))
return(data.list.na)
}
wrangle_tempdata <- function(temp.data) {
# Need to convert asterisk values to NA in order to convert the data to long format
# pivot_longer() cannot combine columns that are of diff atomic types
# Use defined tempdata.convert.na above to accomplish this
temp.data.na <- tempdata.convert.na(temp.data)
# Convert data to long format to allow for date-based time series plot
temp.data.long <- lapply(temp.data.na, function(df) df %>%
pivot_longer(cols = "Jan":"Dec",
names_to = "Month",
values_to = "Temp_diff"))
# Converting month abbv to numeric
# Use data.table::set() to modify col j of the df in place for use in lapply - avoids requiring for loop
# Sets the value of col j to that given by the value argument
lapply(temp.data.long, function(df) set(df, j = "Month", value = match(df$"Month", month.abb)))
# Add Date object column for time series analysis
temp.data.date <- lapply(temp.data.long, add_date_col)
# For each data frame in temp.data.long, create the windowed subset df
# This produces for each df, a list of sub-dfs
# Sub-dfs can be indexed by, e.g., list$airsv6$"2003-2005"
windowed.df.list <- lapply(temp.data.date, df_window_subset)
return(windowed.df.list)
}
gasdata.convert.na <- function(data.list) {
# Use replace_with_na() from packagenaniar to replace values of -9.99 with NA
# See https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html
if (!require(naniar)) {
install.packages("naniar", quietly = TRUE)
}
data.list.na <- lapply(data.list, function(df) df %>%
replace_with_na(replace = list(Uncertainty_avg = -9.99,
Uncertainty_trend = -9.99)))
return(data.list.na)
}
wrangle_gasdata <- function(gas.data) {
# Replace -9.99 values with NA using gasdata.convert.na() function above
gas.data.na <- gasdata.convert.na(gas.data)
gas.data.date <- lapply(gas.data.na, add_date_col)
gas.data.windowed.list <- lapply(gas.data.date, df_window_subset)
return(gas.data.windowed.list)
}
gas.data.windowed.dfs <- wrangle_gasdata(gas.data)
temp.data.windowed.dfs <- wrangle_tempdata(temp.data)Code
# Only a few data frames are shown here for convenience
head(gas.data.windowed.dfs$ch4$"2003-2005")
tail(gas.data.windowed.dfs$ch4$"2003-2005")
head(temp.data.windowed.dfs$airsv7$"2019-2021")
tail(temp.data.windowed.dfs$airsv7$"2019-2021")## Year Month Year_Day_decimal Average_ppm Uncertainty_avg Trend
## 1 2003 12 2003.958 1783.56 1.02 1779.31
## 2 2004 1 2004.042 1780.88 0.99 1779.24
## 3 2004 2 2004.125 1779.97 0.85 1779.06
## 4 2004 3 2004.208 1780.96 0.83 1778.79
## 5 2004 4 2004.292 1780.72 0.74 1778.41
## 6 2004 5 2004.375 1777.55 0.98 1777.95
## Uncertainty_trend Date
## 1 0.50 2003-12-01
## 2 0.49 2004-01-01
## 3 0.49 2004-02-01
## 4 0.49 2004-03-01
## 5 0.49 2004-04-01
## 6 0.49 2004-05-01
## Year Month Year_Day_decimal Average_ppm Uncertainty_avg Trend
## 15 2005 2 2005.125 1775.65 0.99 1774.14
## 16 2005 3 2005.208 1775.36 0.94 1774.05
## 17 2005 4 2005.292 1774.61 0.85 1774.03
## 18 2005 5 2005.375 1772.27 0.67 1774.07
## 19 2005 6 2005.458 1768.57 0.97 1774.16
## 20 2005 7 2005.542 1766.29 0.87 1774.27
## Uncertainty_trend Date
## 15 0.53 2005-02-01
## 16 0.53 2005-03-01
## 17 0.53 2005-04-01
## 18 0.54 2005-05-01
## 19 0.55 2005-06-01
## 20 0.57 2005-07-01
## Year J.D D.N DJF MAM JJA SON Month Temp_diff Date
## 1 2019 0.259 0.239 0.181 0.272 0.240 0.264 12 0.352 2019-12-01
## 2 2020 0.266 0.292 0.356 0.306 0.249 0.257 1 0.325 2020-01-01
## 3 2020 0.266 0.292 0.356 0.306 0.249 0.257 2 0.390 2020-02-01
## 4 2020 0.266 0.292 0.356 0.306 0.249 0.257 3 0.305 2020-03-01
## 5 2020 0.266 0.292 0.356 0.306 0.249 0.257 4 0.334 2020-04-01
## 6 2020 0.266 0.292 0.356 0.306 0.249 0.257 5 0.280 2020-05-01
## Year J.D D.N DJF MAM JJA SON Month Temp_diff Date
## 15 2021 0.117 0.106 0.073 0.089 0.15 0.113 2 0.070 2021-02-01
## 16 2021 0.117 0.106 0.073 0.089 0.15 0.113 3 0.093 2021-03-01
## 17 2021 0.117 0.106 0.073 0.089 0.15 0.113 4 0.078 2021-04-01
## 18 2021 0.117 0.106 0.073 0.089 0.15 0.113 5 0.096 2021-05-01
## 19 2021 0.117 0.106 0.073 0.089 0.15 0.113 6 0.118 2021-06-01
## 20 2021 0.117 0.106 0.073 0.089 0.15 0.113 7 0.204 2021-07-01
2.3 Quality Assessment
Below are some summarized results of the quality assessment performed on the data prior to downstream analysis.
Code
# Column completeness
col_complete <- function(dataframe){
if(!is.data.frame(dataframe)){
stop("the object is not a dataframe")
}
missing_count <- 0
for(j in 1:ncol(dataframe)){
for(i in 1:nrow(dataframe)){
if(is.na(dataframe[i, j])){missing_count <- missing_count+1}
}
}
complete_ratio <- 1 - missing_count/(nrow(dataframe)*ncol(dataframe))
return(complete_ratio)
}
library(visdat)
library(naniar)
library(tidyverse)
# Analyzing for completeness
col_complete(temp.data.converted$airsv6)
col_complete(temp.data.converted$airsv7)
col_complete(temp.data.converted$ghcnv4)
# Analyzing for completeness
col_complete(as.data.frame(gas.data.converted$ch4))
col_complete(as.data.frame(gas.data.converted$n2o))
col_complete(as.data.frame(gas.data.converted$sf6))## [1] 0.9573935
## [1] 0.9573935
## [1] 0.9899749
## [1] 0.9927052
## [1] 0.9927052
## [1] 0.9927052
Code
# Representative plots for analyzing missingness and outliers
vis_miss(temp.data.converted$airsv6)Figure 2.1: Various plots indicating data quality of acquired climate change data
Code
vis_miss(as.data.frame(gas.data.converted$ch4))Figure 2.2: Various plots indicating data quality of acquired climate change data
Code
airsv6 <- airsv6 <- subset(temp.data.converted$airsv6, select = -Year)
airsv7 <- subset(temp.data.converted$airsv7, select = -Year)
airsv6 %>%
select(1:12) %>%
boxplot()Figure 2.3: Various plots indicating data quality of acquired climate change data
Code
airsv7 %>%
select(1:12) %>%
boxplot()Figure 2.4: Various plots indicating data quality of acquired climate change data
The ratios of column completeness for both the “airsv6” and “airsv7” datasets are both identically the same (0.9573935) while the ratio of column completeness for the “ghcnv4” dataset was just a bit higher (0.9899749) than the other two. In other words, the datasets are almost complete in that the few missing values are not of significant concern. All three of the temp datsets are consistent in terms of precision. Most missing values are found in 2002 (for AIRS v6 and AIRS v7), indicating data collection for this software type had likely not started yet.
In the “airsv6” dataset, the months of January, May, June, & November each have one outlier. In the “airsv7” dataset, January, March, & June each have one outlier while November has five outliers.
For the gas data, the ratios of column completeness for all three data sets (ch4, n2o, sf6) are each 0.9927052, which is very good because these few missing values won’t have a significant effect on the data analysis. These missing values are also all for the final months of 2022 - simply indicating the dataset had not yet been updated as data collection and aggregation is likely still ongoing. Furthermore, each column of each of the three datasets are consistently precise. In terms of outliers, the “ch4” dataset has the most outliers (not shown for simplicity).
3 Exploratory Data Analysis - Plots and Commentary
3.1 Gas atmospheric concentration predictions
Code
# CH4
# Post-editorial note: originally, we converted everything to ppb
# We later decided to undo this, but some of our plots still use ppb
# Therefore, I am manually changing the ppm and ppt columns to ppb
colnames(gas.data[[1]])[which(colnames(gas.data[[1]]) == "Average_ppm")] <- "Average_ppb"
gas.data[[1]]$Average_ppb <- gas.data[[1]]$Average_ppb * 1000
ch4 <- as.data.frame(gas.data[1])
# colnames(ch4)[which(colnames(ch4) == "ch4.Average_ppm")] <- "Average_ppb"
# ch4$Average_ppb <- ch4$Average_ppb * 1000
names(ch4) <- gsub('ch4.', "", names(ch4))
# make ch4 dataframes (18 mos)
ch4 <- df_window_subset(ch4)
ch4_1 <- as.data.frame(ch4[1])
ch4_2 <- as.data.frame(ch4[2])
ch4_3 <- as.data.frame(ch4[3])
ch4_4 <- as.data.frame(ch4[4])
ch4_5 <- as.data.frame(ch4[5])
ch4_6 <- as.data.frame(ch4[6])
ch4_7 <- as.data.frame(ch4[7])
ch4_8 <- as.data.frame(ch4[8])
ch4_9 <- as.data.frame(ch4[9])
# fix col names
names(ch4_1) <- substring(names(ch4_1), 12)
names(ch4_2) <- substring(names(ch4_2), 12)
names(ch4_3) <- substring(names(ch4_3), 12)
names(ch4_4) <- substring(names(ch4_4), 12)
names(ch4_5) <- substring(names(ch4_5), 12)
names(ch4_6) <- substring(names(ch4_6), 12)
names(ch4_7) <- substring(names(ch4_7), 12)
names(ch4_8) <- substring(names(ch4_8), 12)
names(ch4_9) <- substring(names(ch4_9), 12)
# add dates
ch4_1 <- add_date_col(ch4_1)
ch4_2 <- add_date_col(ch4_2)
ch4_3 <- add_date_col(ch4_3)
ch4_4 <- add_date_col(ch4_4)
ch4_5 <- add_date_col(ch4_5)
ch4_6 <- add_date_col(ch4_6)
ch4_7 <- add_date_col(ch4_7)
ch4_8 <- add_date_col(ch4_8)
ch4_9 <- add_date_col(ch4_9)
# make one df
ch4_df <- rbind(ch4_1, ch4_2, ch4_3, ch4_4, ch4_5, ch4_6, ch4_7, ch4_8, ch4_9)
ch4_df$Period <- NA
ch4_df$Period[1:20] <- "Dec '03 - July '05"
ch4_df$Period[21:40] <- "Dec '05 - July '07"
ch4_df$Period[41:60] <- "Dec '07 - July '09"
ch4_df$Period[61:80] <- "Dec '09 - July '11"
ch4_df$Period[81:100] <- "Dec '11 - July '13"
ch4_df$Period[101:120] <- "Dec '13 - July '15"
ch4_df$Period[121:140] <- "Dec '15 - July '17"
ch4_df$Period[141:160] <- "Dec '17 - July '19"
ch4_df$Period[161:180] <- "Dec '19 - July '21"
# calculate mean over 18 mo periods
ch4_periods <- c(1,2,3,4,5,6,7,8,9)
ch4_period_names <- names(ch4)
ch4_period_means <- c(mean(ch4_1$Average_ppb), mean(ch4_2$Average_ppb), mean(ch4_3$Average_ppb),
mean(ch4_4$Average_ppb), mean(ch4_5$Average_ppb), mean(ch4_6$Average_ppb),
mean(ch4_7$Average_ppb), mean(ch4_8$Average_ppb), mean(ch4_9$Average_ppb))
ch4_period_df <- data.frame(ch4_periods, ch4_period_names, ch4_period_means)
# N2O
n2o <- as.data.frame(gas.data[2])
names(n2o) <- gsub('n2o.', "", names(n2o))
# make n2o dataframes (18 mos)
n2o <- df_window_subset(n2o)
n2o_1 <- as.data.frame(n2o[1])
n2o_2 <- as.data.frame(n2o[2])
n2o_3 <- as.data.frame(n2o[3])
n2o_4 <- as.data.frame(n2o[4])
n2o_5 <- as.data.frame(n2o[5])
n2o_6 <- as.data.frame(n2o[6])
n2o_7 <- as.data.frame(n2o[7])
n2o_8 <- as.data.frame(n2o[8])
n2o_9 <- as.data.frame(n2o[9])
# fix col names
names(n2o_1) <- substring(names(n2o_1), 12)
names(n2o_2) <- substring(names(n2o_2), 12)
names(n2o_3) <- substring(names(n2o_3), 12)
names(n2o_4) <- substring(names(n2o_4), 12)
names(n2o_5) <- substring(names(n2o_5), 12)
names(n2o_6) <- substring(names(n2o_6), 12)
names(n2o_7) <- substring(names(n2o_7), 12)
names(n2o_8) <- substring(names(n2o_8), 12)
names(n2o_9) <- substring(names(n2o_9), 12)
# add dates
n2o_1 <- add_date_col(n2o_1)
n2o_2 <- add_date_col(n2o_2)
n2o_3 <- add_date_col(n2o_3)
n2o_4 <- add_date_col(n2o_4)
n2o_5 <- add_date_col(n2o_5)
n2o_6 <- add_date_col(n2o_6)
n2o_7 <- add_date_col(n2o_7)
n2o_8 <- add_date_col(n2o_8)
n2o_9 <- add_date_col(n2o_9)
# make one df
n2o_df <- rbind(n2o_1, n2o_2, n2o_3, n2o_4, n2o_5, n2o_6, n2o_7, n2o_8, n2o_9)
n2o_df$Period <- NA
n2o_df$Period[1:20] <- "Dec '03 - July '05"
n2o_df$Period[21:40] <- "Dec '05 - July '07"
n2o_df$Period[41:60] <- "Dec '07 - July '09"
n2o_df$Period[61:80] <- "Dec '09 - July '11"
n2o_df$Period[81:100] <- "Dec '11 - July '13"
n2o_df$Period[101:120] <- "Dec '13 - July '15"
n2o_df$Period[121:140] <- "Dec '15 - July '17"
n2o_df$Period[141:160] <- "Dec '17 - July '19"
n2o_df$Period[161:180] <- "Dec '19 - July '21"
# calculate mean over 18 mo periods
n2o_periods <- c(1,2,3,4,5,6,7,8,9)
n2o_period_names <- names(n2o)
n2o_period_means <- c(mean(n2o_1$Average_ppb), mean(n2o_2$Average_ppb), mean(n2o_3$Average_ppb),
mean(n2o_4$Average_ppb), mean(n2o_5$Average_ppb), mean(n2o_6$Average_ppb),
mean(n2o_7$Average_ppb), mean(n2o_8$Average_ppb), mean(n2o_9$Average_ppb))
n2o_period_df <- data.frame(n2o_periods, n2o_period_names, n2o_period_means)
# SF6
colnames(gas.data[[3]])[which(colnames(gas.data[[3]]) == "Average_ppt")] <- "Average_ppb"
gas.data[[3]]$Average_ppb <- gas.data[[3]]$Average_ppb / 1000
sf6 <- as.data.frame(gas.data[3])
names(sf6) <- gsub('sf6.', "", names(sf6))
# make sf6 dataframes (18 mos)
sf6 <- df_window_subset(sf6)
sf6_1 <- as.data.frame(sf6[1])
sf6_2 <- as.data.frame(sf6[2])
sf6_3 <- as.data.frame(sf6[3])
sf6_4 <- as.data.frame(sf6[4])
sf6_5 <- as.data.frame(sf6[5])
sf6_6 <- as.data.frame(sf6[6])
sf6_7 <- as.data.frame(sf6[7])
sf6_8 <- as.data.frame(sf6[8])
sf6_9 <- as.data.frame(sf6[9])
# fix col names
names(sf6_1) <- substring(names(sf6_1), 12)
names(sf6_2) <- substring(names(sf6_2), 12)
names(sf6_3) <- substring(names(sf6_3), 12)
names(sf6_4) <- substring(names(sf6_4), 12)
names(sf6_5) <- substring(names(sf6_5), 12)
names(sf6_6) <- substring(names(sf6_6), 12)
names(sf6_7) <- substring(names(sf6_7), 12)
names(sf6_8) <- substring(names(sf6_8), 12)
names(sf6_9) <- substring(names(sf6_9), 12)
# add dates
sf6_1 <- add_date_col(sf6_1)
sf6_2 <- add_date_col(sf6_2)
sf6_3 <- add_date_col(sf6_3)
sf6_4 <- add_date_col(sf6_4)
sf6_5 <- add_date_col(sf6_5)
sf6_6 <- add_date_col(sf6_6)
sf6_7 <- add_date_col(sf6_7)
sf6_8 <- add_date_col(sf6_8)
sf6_9 <- add_date_col(sf6_9)
# make one df
sf6_df <- rbind(sf6_1, sf6_2, sf6_3, sf6_4, sf6_5, sf6_6, sf6_7, sf6_8, sf6_9)
sf6_df$Period <- NA
sf6_df$Period[1:20] <- "Dec '03 - July '05"
sf6_df$Period[21:40] <- "Dec '05 - July '07"
sf6_df$Period[41:60] <- "Dec '07 - July '09"
sf6_df$Period[61:80] <- "Dec '09 - July '11"
sf6_df$Period[81:100] <- "Dec '11 - July '13"
sf6_df$Period[101:120] <- "Dec '13 - July '15"
sf6_df$Period[121:140] <- "Dec '15 - July '17"
sf6_df$Period[141:160] <- "Dec '17 - July '19"
sf6_df$Period[161:180] <- "Dec '19 - July '21"
# caculate mean over 18 mo periods
sf6_periods <- c(1,2,3,4,5,6,7,8,9)
sf6_period_names <- names(sf6)
sf6_period_means <- c(mean(sf6_1$Average_ppb), mean(sf6_2$Average_ppb), mean(sf6_3$Average_ppb),
mean(sf6_4$Average_ppb), mean(sf6_5$Average_ppb), mean(sf6_6$Average_ppb),
mean(sf6_7$Average_ppb), mean(sf6_8$Average_ppb), mean(sf6_9$Average_ppb))
sf6_period_df <- data.frame(sf6_periods, sf6_period_names, sf6_period_means)
############################################################################
# make data frames with every month 12/03-07/22
ch4_nonperiod <- as.data.frame(gas.data[1])
names(ch4_nonperiod) <- gsub('ch4.', "", names(ch4_nonperiod))
n2o_nonperiod <- as.data.frame(gas.data[2])
names(n2o_nonperiod) <- gsub('n2o.', "", names(n2o_nonperiod))
sf6_nonperiod <- as.data.frame(gas.data[3])
names(sf6_nonperiod) <- gsub('sf6.', "", names(sf6_nonperiod))We first developed boxplots to explore the overall distribution of the gas data for each 20 month period that we defined earlier.
Code
theme_bluewhite <- function(base_size = 11, base_family = "") {
theme_bw() %+replace%
theme(
panel.grid.major = element_line(color = "white"),
panel.background = element_rect(fill = "lightblue"),
panel.border = element_rect(color = "lightblue", fill = NA),
axis.line = element_line(color = "lightblue"),
axis.ticks = element_line(color = "lightblue"),
axis.text = element_text(color = "steelblue")
)
}
a <- ggplot(ch4_df) +
geom_boxplot(aes(x = as.factor(Period), y = Average_ppb)) +
labs(title = "CH4") + xlab("20 Month Period") + ylab("PPB") +
theme_bluewhite() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 0.5,
vjust = .5
))
b <- ggplot(n2o_df) +
geom_boxplot(aes(x = as.factor(Period), y = Average_ppb)) +
labs(title = "N2O") + xlab("20 Month Period") + ylab("PPB") +
theme_bluewhite() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 0.5,
vjust = .5
))
c <- ggplot(sf6_df) +
geom_boxplot(aes(x = as.factor(Period), y = Average_ppb)) +
labs(title = "SF6") + xlab("20 Month Period") + ylab("PPB") +
theme_bluewhite() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 0.5,
vjust = .5
))
plot_grid(a, b, c)Figure 3.1: Boxplots displaying distribution of mole fraction data for the indicated 20 month windows for each of methane (top left), nitrous oxide (top right), and sulfur hexafluoride (bottom left). PPB = parts per billion.
We noticed a generally linear increase in the central tendency of the distributions as we progressed through each 20 month period. Therefore, we were interested in exploring global greenhouse gas concentration trends through a linear model. In particular, we wanted to understand how the COVID-19-induced changes in gas atmospheric concentrations potentially perturbed linear increases in these concentrations. Therefore, linear models were created with and without the COVID-19 period to see how temperature predictions were perturbed by this short-term period of decrease human activity.
Code
x <- ggplot() +
geom_point(data = ch4_nonperiod[12:218,], aes(x = Year_Day_decimal, y = Average_ppb))+
geom_smooth(data = ch4_df, aes(x = Year_Day_decimal, y = Average_ppb, col = "With Covid Period"), method = "lm",
fullrange = TRUE) +
geom_smooth(data = ch4_df[1:160, ], aes(x = Year_Day_decimal, y = Average_ppb, col = "Without Covid Period"),
method = "lm", fullrange=TRUE) + xlim(c(2003,2050)) +
xlab("Year") + ylab("PPB") + ggtitle("CH4") +
scale_color_manual(name='Linear Trend', breaks=c('With Covid Period', 'Without Covid Period'),
values=c('With Covid Period'='blue', 'Without Covid Period'='red'))
y <- ggplot() +
geom_point(data = n2o_nonperiod[12:218,], aes(x = Year_Day_decimal, y = Average_ppb))+
geom_smooth(data = n2o_df, aes(x = Year_Day_decimal, y = Average_ppb, col = "With Covid Period"), method = "lm",
fullrange = TRUE) +
geom_smooth(data = n2o_df[1:160, ], aes(x = Year_Day_decimal, y = Average_ppb, col = "Without Covid Period"),
method = "lm", fullrange=TRUE) + xlim(c(2003,2050)) +
xlab("Year") + ylab("PPB") + ggtitle("N2O") +
scale_color_manual(name='Linear Trend', breaks=c('With Covid Period', 'Without Covid Period'),
values=c('With Covid Period'='blue', 'Without Covid Period'='red'))
z <- ggplot() +
geom_point(data = sf6_nonperiod[12:218,], aes(x = Year_Day_decimal, y = Average_ppb))+
geom_smooth(data = sf6_df, aes(x = Year_Day_decimal, y = Average_ppb, col = "With Covid Period"), method = "lm",
fullrange = TRUE) +
geom_smooth(data = sf6_df[1:160, ], aes(x = Year_Day_decimal, y = Average_ppb, col = "Without Covid Period"),
method = "lm", fullrange=TRUE) + xlim(c(2003,2050)) +
xlab("Year") + ylab("PPB") + ggtitle("SF6") +
scale_color_manual(name='Linear Trend', breaks=c('With Covid Period', 'Without Covid Period'),
values=c('With Covid Period'='blue', 'Without Covid Period'='red'))
plot_grid(x,y,z)Figure 3.2: Linear models fit across monthly average gas concentration data with the defined COVID-19 period of December 2019 - July 2021 (blue line), and without the COVID-19 period (red line) extrapolated to 2050 for each of the three indicated greenhouse gases. Black dots indicate actual concentration data. Shaded areas around each line indicates the 95% confidence interval for the true regression line. PPB = parts per billion.
Below are the regression line formulas used to calculate the 2050 PPB prediction with and without the Covid period. The results of the linear model were opposite to what we expected. Removing the Covid period increased the predicted PPB across the board for our three greenhouse gasses of interest.
Code
# 2050 predictions
options(scipen=999)
#ch4 with covid
lm <- lm(formula = ch4_df$Average_ppb ~ ch4_df$Year_Day_decimal)
#ch4 without covid
lm_wo <- lm(formula = ch4_df[1:160, ]$Average_ppb ~ ch4_df[1:160, ]$Year_Day_decimal)
#n2o with covid
lm1 <- lm(formula = n2o_df$Average_ppb ~ n2o_df$Year_Day_decimal)
# n2o without covid
lm1_wo <- lm(formula = n2o_df[1:160, ]$Average_ppb ~ n2o_df[1:160, ]$Year_Day_decimal)
# sf6 with covid
lm2 <- lm(formula = sf6_df$Average_ppb ~ sf6_df$Year_Day_decimal)
# sf6 without covid
lm2_wo <- lm(formula = sf6_df[1:160, ]$Average_ppb ~ sf6_df[1:160, ]$Year_Day_decimal)CH4 prediction with covid period: 2072682.147 ppb.
CH4 prediction without covid period: 2049570.079 ppb.
N2O prediction with covid period: 360.798 ppb.
N2O prediction without covid period: 360.012 ppb.
SF6 prediction with covid period: 0.019 ppb
SF6 prediction without covid period: 0.019 ppb.
3.2 NASA Gas Concentration Time Series and Moving Average Analysis
3.2.1 Gas concentration time series plots
So, we’ve identified some predictive models of the gas data, but it would be nice to get a better, clearer picture of the individual trends in each of the three gases. Namely, what kinds of fluctuation do we see within and between the gas datasets? In order to get a general sense for trends among the gas data, we can look at simple time series plots.
Code
df_window_subset_full <- function(df, init.year.wind = 2003, wind.size = 1, final.init.year = 2021) {
moving.dfsub.list <- list()
while (init.year.wind <= final.init.year) {
end.year.wind <- init.year.wind + wind.size
# Filter for only years in the current window
moving.df <- df %>%
filter(Year >= init.year.wind & Year < end.year.wind)
moving.dfsub.list[[paste(init.year.wind)]] <- as.data.frame(moving.df)
init.year.wind <- init.year.wind + wind.size
}
return(moving.dfsub.list)
}
wrangle_gasdata_full <- function() {
# Use replace_with_na() from packagenaniar to replace values of -9.99 with NA
# See https://cran.r-project.org/web/packages/naniar/vignettes/replace-with-na.html
if (!require(naniar)) {
install.packages("naniar", quietly = TRUE)
}
gas.data.na <- lapply(gas.data, function(df) df %>%
replace_with_na(replace = list(Uncertainty_avg = -9.99,
Uncertainty_trend = -9.99)))
gas.data.date <- lapply(gas.data.na, add_date_col)
gas.data.windowed.list <- lapply(gas.data.date, df_window_subset_full)
return(gas.data.windowed.list)
}
#introduce dataset from gas_temp_wrangler.r file function
gases <- wrangle_gasdata(gas.data) #Windowed (change to df_window_subset function)
gases_full <- wrangle_gasdata_full() #Contiguous (change to df_window_subset_full function)
#head(gases)
# View(gases$sf6)
#View(gases_full)
# View(gases_full$sf6$`2021`)
#-------------------Dataframe Initialization----------------------
##########convert data to subsetted Dataframes########
#sf6.df <- as.data.frame(gases$sf6)<- for horizontal data
big.sf6 <- bind_rows(gases$sf6)
#ch4.df <- as.data.frame(gases$ch4)<- for horizontal data
big.ch4 <- bind_rows(gases$ch4)
#n2o.df <- as.data.frame(gases$n2o)<- for horizontal data
big.n2o <- bind_rows(gases$n2o)
#########convert data to contiguous Dataframes########
#sf6_full.df <- as.data.frame(gases_full$sf6) <- for horizontal data
big.sf6_full <- bind_rows(gases_full$sf6)
#ch4.df_full <- as.data.frame(gases_full$ch4) <- for horizontal data
big.ch4_full <- bind_rows(gases_full$ch4)
#n2o.df_full <- as.data.frame(gases_full$n2o) <- for horizontal data
big.n2o_full <- bind_rows(gases_full$n2o)
################Combine the Dataframes#########################
for (i in 1:length(big.sf6)) {
big.sf6$gas <- "sf6"
}
for (i in 1:length(big.ch4)) {
big.ch4$gas <- "ch4"
}
for (i in 1:length(big.n2o)) {
big.n2o$gas <- "n2o"
}
big.gas <- rbind(big.ch4, big.n2o, big.sf6)
#Joined Time Series (contiguous)
for (i in 1:length(big.sf6_full)) {
big.sf6_full$gas <- "sf6"
}
for (i in 1:length(big.ch4_full)) {
big.ch4_full$gas <- "ch4"
}
for (i in 1:length(big.n2o_full)) {
big.n2o_full$gas <- "n2o"
}
big.gas_full <- rbind(big.ch4_full, big.n2o_full, big.sf6_full)Code
#----------------------------Time Series ---------------------------------
#############Generate plots for each gas####################
#sf6 time series
tmp <- data.frame(rectangle_x = c(big.sf6_full[big.sf6_full$Date=="2019-12-01", "Date"],
big.sf6_full[big.sf6_full$Date=="2019-12-01", "Date"],
max(big.sf6_full$Date),
max(big.sf6_full$Date)),
rectangle_y = c(-Inf,
Inf,
Inf,
-Inf))
sf6.plot <- ggplot(big.sf6_full, aes(x = Date, y = Average_ppb)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color = "#69b3a2") +
geom_polygon(data = tmp, aes(x = rectangle_x, y = rectangle_y),
fill = "red",
alpha = .3,
linetype = "dashed") +
scale_x_date(date_breaks = "12 months", date_labels = "%Y %b %d") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
geom_vline(linetype = 4, xintercept = as.numeric(big.sf6_full[big.sf6_full$Date=="2019-12-01", "Date"]), color = "red") +
ggtitle("SF6 Concentration From 2003 - 2021") +
annotate("text",
x = big.sf6_full[big.sf6_full$Date=="2019-08-01", "Date"],
y = .006,
label = "First Reported Case of COVID-19",
angle = 90,
size = 4)
#sf6.plot
sf6_ly <- ggplotly(sf6.plot)
#ch4 time series
tmp <- data.frame(rectangle_x = c(big.ch4_full[big.ch4_full$Date=="2019-12-01", "Date"],
big.ch4_full[big.ch4_full$Date=="2019-12-01", "Date"],
max(big.ch4_full$Date),
max(big.ch4_full$Date)),
rectangle_y = c(-Inf,
Inf,
Inf,
-Inf))
ch4.plot <- ggplot(big.ch4_full, aes(x = Date, y = Average_ppb)) +
geom_area(fill="orange", alpha=0.5) +
geom_line(color = "orange") +
geom_polygon(data = tmp, aes(x = rectangle_x, y = rectangle_y),
fill = "red",
alpha = .3,
linetype = "dashed") +
scale_x_date(date_breaks = "12 months", date_labels = "%Y %b %d") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
geom_vline(linetype = 4, xintercept = as.numeric(big.ch4_full[big.ch4_full$Date=="2019-12-01", "Date"]), color = "red") +
ggtitle("CH4 Concentration From 2003 - 2021") +
annotate("text",
x = big.ch4_full[big.ch4_full$Date=="2019-09-01", "Date"],
y = 1000000,
label = "First Reported Case of COVID-19",
angle = 90,
size = 4)
#ch4.plot
ch4_ly <- ggplotly(ch4.plot)
#n2o time series
tmp <- data.frame(rectangle_x = c(big.n2o_full[big.n2o_full$Date=="2019-12-01", "Date"],
big.n2o_full[big.n2o_full$Date=="2019-12-01", "Date"],
max(big.n2o_full$Date),
max(big.n2o_full$Date)),
rectangle_y = c(-Inf,
Inf,
Inf,
-Inf))
n2o.plot <- ggplot(big.n2o_full, aes(x = Date, y = Average_ppb)) +
geom_area(fill="blue", alpha=0.5) +
geom_line(color = "blue") +
geom_polygon(data = tmp, aes(x = rectangle_x, y = rectangle_y),
fill = "red",
alpha = .3,
linetype = "dashed") +
scale_x_date(date_breaks = "12 months", date_labels = "%Y %b %d") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
geom_vline(linetype = 4, xintercept = as.numeric(big.n2o_full[big.n2o_full$Date=="2019-12-01", "Date"]), color = "red") +
ggtitle("N2O Concentration From 2003 - 2021") +
annotate("text",
x = big.n2o_full[big.n2o_full$Date=="2019-08-01", "Date"],
y = 150,
label = "First Reported Case of COVID-19",
angle = 90,
size = 4)
#n2o.plot
n2o_ly <- ggplotly(n2o.plot)
#ggplots
plot_grid(sf6.plot, ch4.plot, n2o.plot)Figure 3.3: Time Series plots for atmospheric SF6 (top left), CH4 (top right), and N2O (bottom left). Shaded red rectangle indicates the window defined in this analysis as the COVID-19 window.
Code
#plotlysCode
sf6_lyFigure 3.4: Interactive ggplotly time series plots of indicated gas atmospheric concentration from 2003 to 2022.
Code
ch4_lyFigure 3.4: Interactive ggplotly time series plots of indicated gas atmospheric concentration from 2003 to 2022.
Code
n2o_lyFigure 3.4: Interactive ggplotly time series plots of indicated gas atmospheric concentration from 2003 to 2022.
At a glance, the time series don’t suggest that there was much of a trend difference before or after the pandemic. Note that the y-xis scaling for each gas is different, so what’s more important here is the rate of change across time. SF6 seemed to show the fastest rate of change, and all 3 plots show a general upward trend in gas concentrations.
Let’s break up the series into COVID-length (~20 months) windows:
Code
#########################Time Series by Covid Length Blocks#####################
#SF6-------------
block1 <- as.data.frame(gases$sf6$'2003-2005')
block2 <- as.data.frame(gases$sf6$'2005-2007')
block3 <- as.data.frame(gases$sf6$'2007-2009')
block4 <- as.data.frame(gases$sf6$'2009-2011')
block5 <- as.data.frame(gases$sf6$'2011-2013')
block6 <- as.data.frame(gases$sf6$'2013-2015')
block7 <- as.data.frame(gases$sf6$'2015-2017')
block8 <- as.data.frame(gases$sf6$'2017-2019')
block9 <- as.data.frame(gases$sf6$'2019-2021')
time.blocks <- list(block1,
block2,
block3,
block4,
block5,
block6,
block7,
block8,
block9)
# str(time.blocks)
block_list <- list()
for (i in time.blocks){
new_element <- ggplot(i, aes(x = Date, y = Average_ppb)) +
geom_line(color = "blue") +
scale_x_date(date_breaks = "6 months", date_labels = "%Y %b %d") +
theme(axis.text.x=element_text(angle=0, hjust=1))
block_list[[length(block_list) + 1]] <- new_element
}
# block_list
block.names <- list('SF6 Concentrations for 2003-2005',
'SF6 Concentrations for 2005-2007',
'SF6 Concentrations for 2007-2009',
'SF6 Concentrations for 2009-2011',
'SF6 Concentrations for 2011-2013',
'SF6 Concentrations for 2013-2015',
'SF6 Concentrations for 2015-2017',
'SF6 Concentrations for 2017-2019',
'SF6 Concentrations for 2019-2021')
ggarrange(plotlist = block_list,
ncol = 3, nrow = 3,
labels = block.names,
hjust = -.2555,
vjust = 1.5)Figure 3.5: Windowed time series plots of the indicated gas concentration over 20 months periods; the timeframe of each period is described in the subfigure title. Each period runs from December of the start year to July of the end year. For example, the time series plot for the period 2003 - 2005 (top left in all figures) is a plot of monthly average gas concentrations from December of 2003 to July of 2005 to mimic the defined COVID-19 window.
Code
#CH4--------------
block1 <- as.data.frame(gases$ch4$'2003-2005')
block2 <- as.data.frame(gases$ch4$'2005-2007')
block3 <- as.data.frame(gases$ch4$'2007-2009')
block4 <- as.data.frame(gases$ch4$'2009-2011')
block5 <- as.data.frame(gases$ch4$'2011-2013')
block6 <- as.data.frame(gases$ch4$'2013-2015')
block7 <- as.data.frame(gases$ch4$'2015-2017')
block8 <- as.data.frame(gases$ch4$'2017-2019')
block9 <- as.data.frame(gases$ch4$'2019-2021')
time.blocks <- list(block1,
block2,
block3,
block4,
block5,
block6,
block7,
block8,
block9)
#str(time.blocks)
block_list <- list()
for (i in time.blocks){
new_element <- ggplot(i, aes(x = Date, y = Average_ppb)) +
geom_line(color = "blue") +
scale_x_date(date_breaks = "1 months", date_labels = "%b") +
theme(axis.text.x=element_text(angle=0, hjust=1))
block_list[[length(block_list) + 1]] <- new_element
}
block.names <- list('CH4 Concentrations for 2003-2005',
'CH4 Concentrations for 2005-2007',
'CH4 Concentrations for 2007-2009',
'CH4 Concentrations for 2009-2011',
'CH4 Concentrations for 2011-2013',
'CH4 Concentrations for 2013-2015',
'CH4 Concentrations for 2015-2017',
'CH4 Concentrations for 2017-2019',
'CH4 Concentrations for 2019-2021')
ggarrange(plotlist = block_list,
ncol = 3, nrow = 3,
labels = block.names,
hjust = -.2555,
vjust = 1.5)Figure 3.6: Windowed time series plots of the indicated gas concentration over 20 months periods; the timeframe of each period is described in the subfigure title. Each period runs from December of the start year to July of the end year. For example, the time series plot for the period 2003 - 2005 (top left in all figures) is a plot of monthly average gas concentrations from December of 2003 to July of 2005 to mimic the defined COVID-19 window.
Code
#N2O----------
block1 <- as.data.frame(gases$n2o$'2003-2005')
block2 <- as.data.frame(gases$n2o$'2005-2007')
block3 <- as.data.frame(gases$n2o$'2007-2009')
block4 <- as.data.frame(gases$n2o$'2009-2011')
block5 <- as.data.frame(gases$n2o$'2011-2013')
block6 <- as.data.frame(gases$n2o$'2013-2015')
block7 <- as.data.frame(gases$n2o$'2015-2017')
block8 <- as.data.frame(gases$n2o$'2017-2019')
block9 <- as.data.frame(gases$n2o$'2019-2021')
time.blocks <- list(block1,
block2,
block3,
block4,
block5,
block6,
block7,
block8,
block9)
#str(time.blocks)
block_list <- list()
for (i in time.blocks){
new_element <- ggplot(i, aes(x = Date, y = Average_ppb)) +
geom_line(color = "blue") +
scale_x_date(date_breaks = "1 months", date_labels = "%b") +
theme(axis.text.x=element_text(angle=0, hjust=1))
block_list[[length(block_list) + 1]] <- new_element
}
block.names <- list('N2O Concentrations for 2003-2005',
'N2O Concentrations for 2005-2007',
'N2O Concentrations for 2007-2009',
'N2O Concentrations for 2009-2011',
'N2O Concentrations for 2011-2013',
'N2O Concentrations for 2013-2015',
'N2O Concentrations for 2015-2017',
'N2O Concentrations for 2017-2019',
'N2O Concentrations for 2019-2021')
ggarrange(plotlist = block_list,
ncol = 3, nrow = 3,
labels = block.names,
hjust = -.2555,
vjust = 1.5)Figure 3.7: Windowed time series plots of the indicated gas concentration over 20 months periods; the timeframe of each period is described in the subfigure title. Each period runs from December of the start year to July of the end year. For example, the time series plot for the period 2003 - 2005 (top left in all figures) is a plot of monthly average gas concentrations from December of 2003 to July of 2005 to mimic the defined COVID-19 window.
When broken up, it’s a little easier to compare pre/post COVID trends. SF6 looks a little smoother within the post-COVID window. CH4 follows similar patterns within each window, but there seems to be a more dramatic dip during the post-COVID month of March 2021. N2O shows a similar trend to SF6, with a smoother plot and less dramatic spike during the COVID window.
3.2.2 Gas atmospheric concentration heatmaps
Sometimes it can be difficult to show very small changes along a large time frame using just a line. In this case, other, more visually nuanced plots can come in handy. Let’s check out heat maps for the gas concentration data:
Code
#---------------Heat Map(s)-----------------------------
######################### SF6 #########################
sf6_heat.plot <-ggplot(big.sf6_full,aes(x = Year, y= Month, fill=Average_ppb))+
geom_tile(color= "white",size=0.1) +
scale_fill_viridis(name="Average_ppb",option ="C", discrete = FALSE)
sf6_heat.plot <-sf6_heat.plot + scale_y_continuous(trans = "reverse", breaks = unique(big.gas_full$sf6))
sf6_heat.plot <-sf6_heat.plot + scale_y_continuous(breaks =c(1:12))
sf6_heat.plot <-sf6_heat.plot + theme_minimal(base_size = 8)
sf6_heat.plot <-sf6_heat.plot + labs(title= paste("SF6 Concentration"), x="Year", y="Month")
sf6_heat.plot <-sf6_heat.plot + theme(legend.position = "bottom")+
theme(plot.title=element_text(size = 14))+
theme(axis.text.y=element_text(size=10)) +
theme(strip.background = element_rect(colour="white"))+
theme(plot.title=element_text(hjust=0))+
theme(axis.ticks=element_blank())+
theme(axis.text=element_text(size=10))+
theme(legend.title=element_text(size=10))+
theme(legend.text=element_text(size=8)) +
theme(legend.key.width = unit(1, 'cm'))
######################### CH4 #########################
ch4_heat.plot <-ggplot(big.ch4_full,aes(x = Year, y= Month, fill=Average_ppb))+
geom_tile(color= "white",size=0.1) +
scale_fill_viridis(name="Average_ppb",option ="C", discrete = FALSE)
ch4_heat.plot <-ch4_heat.plot + scale_y_continuous(trans = "reverse", breaks = unique(big.gas_full$ch4))
ch4_heat.plot <-ch4_heat.plot + scale_y_continuous(breaks =c(1:12))
ch4_heat.plot <-ch4_heat.plot + theme_minimal(base_size = 8)
ch4_heat.plot <-ch4_heat.plot + labs(title= paste("CH4 Concentration"), x="Year", y="Month")
ch4_heat.plot <-ch4_heat.plot + theme(legend.position = "bottom")+
theme(plot.title=element_text(size = 14))+
theme(axis.text.y=element_text(size=10)) +
theme(strip.background = element_rect(colour="white"))+
theme(plot.title=element_text(hjust=0))+
theme(axis.ticks=element_blank())+
theme(axis.text=element_text(size=10))+
theme(legend.title=element_text(size=10))+
theme(legend.text=element_text(size=8)) +
theme(legend.key.width = unit(1, 'cm'))
######################### N2o #########################
n2o_heat.plot <-ggplot(big.n2o_full,aes(x = Year, y= Month, fill=Average_ppb))+
geom_tile(color= "white",size=0.1) +
scale_fill_viridis(name="Average_ppb",option ="C", discrete = FALSE)
n2o_heat.plot <-n2o_heat.plot + scale_y_continuous(trans = "reverse", breaks = unique(big.gas_full$n2o))
n2o_heat.plot <-n2o_heat.plot + scale_y_continuous(breaks =c(1:12))
n2o_heat.plot <-n2o_heat.plot + theme_minimal(base_size = 8)
n2o_heat.plot <-n2o_heat.plot + labs(title= paste("N2O Concentration"), x="Year", y="Month")
n2o_heat.plot <-n2o_heat.plot + theme(legend.position = "bottom")+
theme(plot.title=element_text(size = 14))+
theme(axis.text.y=element_text(size=10)) +
theme(strip.background = element_rect(colour="white"))+
theme(plot.title=element_text(hjust=0))+
theme(axis.ticks=element_blank())+
theme(axis.text=element_text(size=10))+
theme(legend.title=element_text(size=10))+
theme(legend.text=element_text(size=8)) +
theme(legend.key.width = unit(1, 'cm'))
######################## Moment of Truth ###############
ggplotly(sf6_heat.plot)Figure 3.8: Heatmap of monthly gas concentrations of the indicated gas for years 2003 - 2021
Code
ggplotly(ch4_heat.plot)Figure 3.8: Heatmap of monthly gas concentrations of the indicated gas for years 2003 - 2021
Code
ggplotly(n2o_heat.plot)Figure 3.8: Heatmap of monthly gas concentrations of the indicated gas for years 2003 - 2021
Code
## Big Boy---------
# lapply(big.gas_full$gas, function(x) {
# gg <- ggplot(filter(big.gas_full, gas==x),
# aes(x=Year, y=Month, fill=Average_ppb, frame=gas))
# gg <- gg + geom_tile(color="white", linewidth=0.1)
# gg <- gg + scale_x_discrete(expand=c(0,0))
# gg <- gg + scale_y_discrete(expand=c(0,0))
# gg <- gg + scale_fill_viridis(name="")
# gg <- gg + coord_equal()
# gg <- gg + labs(x="Year", y="Month",
# title=sprintf("%s", x))
# gg <- gg + theme_tufte()
# gg <- gg + theme(axis.ticks=element_blank())
# gg <- gg + theme(axis.text=element_text(size=5))
# gg <- gg + theme(panel.border=element_blank())
# gg <- gg + theme(plot.title=element_text(hjust=0, size=6))
# gg <- gg + theme(panel.spacing.x=unit(0.5, "cm"))
# gg <- gg + theme(panel.spacing.y=unit(0.5, "cm"))
# gg <- gg + theme(legend.title=element_text(size=6))
# gg <- gg + theme(legend.title.align=1)
# gg <- gg + theme(legend.text=element_text(size=6))
# gg <- gg + theme(legend.position="bottom")
# gg <- gg + theme(legend.key.size=unit(0.2, "cm"))
# gg <- gg + theme(legend.key.width=unit(1, "cm"))
# gg
# }) -> cclist
#
# cclist[["ncol"]] <- 1
#
# do.call(grid.arrange, cclist)
combined_heat.plot <- plot_grid(sf6_heat.plot,
ch4_heat.plot,
n2o_heat.plot,
labels = c("A", "B", "C"),
ncol = 1,
nrow = 3)
combined_heat.plotFigure 3.8: Heatmap of monthly gas concentrations of the indicated gas for years 2003 - 2021
The heat maps provide a little more dimensionality as well to our understanding of the data. Besides the upward trend in gas concentration along each year, we can see that the later months of each year seem to have higher levels than earlier months of the same year. Comparing windows as well, CH4 seems to have a little more of a dramatic change in concentration during the COVID window than among previous years/windows.
3.2.3 Moving average plots
A moving average is taken by finding the averages of certain defined subsets within a dataset. With each progressive calculation, the “first” element of each subset is replaced by the “next” element in the defined subset list. This can be useful when trying to ascertain the trend shifts in data and smooth out potentially volatile data sets.
There are several different types of moving averages, but for the next few plots we define only two:
The first is the simple moving average (SMA), where a defined number of entries are added together and divided by the entry count. The subset can be defined differently depending on what’s needed, but in this case the subset it used by treating each x-value as the median value for its subset.
The second is the Double Exponential Moving Average (DEMA). This is more complex, but is also more sensitive to recent changes in the data. Its calculation is a little complicated, but essentially, one performs a weighted SMA that is smoothed by subtracting the weighted SMA of that trend from double the original moving average, making it a highly responsive smoothing average.
Code
#---------------------Moving Average-------------------
##################### sf6 rate of change ###############
sf6_full.sub_list <- c(big.sf6_full$Average_ppb[-1],last(big.sf6_full$Average_ppb))
big.sf6_full$avg_change <- sf6_full.sub_list - big.sf6_full$Average_ppb
sf6.rate_of_change <- big.sf6_full %>%
ggplot(aes(x = Date, y = avg_change)) +
#geom_line() +
geom_ma(color = "blue", ma_fun = SMA, n = 20, linetype = "solid") +# Plot windowed SMA
geom_ma(ma_fun = DEMA, n = 20, color = "red", linetype = "longdash") + #Plot windowed DEMA+
scale_x_date(date_breaks = "20 months", date_labels = "%Y %b %d") +
coord_x_date(xlim = c("2003-12-01", "2021-07-01")) + # Zoom in
labs(x = "Year", y = "Average Change (ppb)",
title = "Average Monthly SF6 Changes From 2003 - 2021",
subtitle = "Blue = SMA, Red = DEMA, n = 20") +
theme_minimal(base_line_size = .05) +
theme(text = element_text(size = 16),
axis.text.x=element_text(angle=60, hjust=1))
#sf6.rate_of_change
##################### ch4 rate of change ###############
ch4_full.sub_list <- c(big.ch4_full$Average_ppb[-1],last(big.ch4_full$Average_ppb))
big.ch4_full$avg_change <- ch4_full.sub_list - big.ch4_full$Average_ppb
ch4.rate_of_change <- big.ch4_full %>%
ggplot(aes(x = Date, y = avg_change)) +
#geom_line() +
geom_ma(color = "blue", ma_fun = SMA, n = 20, linetype = "solid") +# Plot windowed SMA
geom_ma(ma_fun = DEMA, n = 20, color = "red", linetype = "longdash") + #Plot windowed DEMA
scale_x_date(date_breaks = "20 months", date_labels = "%Y %b %d") +
coord_x_date(xlim = c("2003-12-01", "2021-07-01")) + # Zoom in
labs(x = "Year", y = "Average Change (ppb)",
title = "Average Monthly CH4 Changes From 2003 - 2021",
subtitle = "Blue = SMA, Red = DEMA, n = 20") +
theme_minimal(base_line_size = .05) +
theme(text = element_text(size = 16),
axis.text.x=element_text(angle=60, hjust=1))
#ch4.rate_of_change
##################### N2O rate of change ###############
n2o_full.sub_list <- c(big.n2o_full$Average_ppb[-1],last(big.n2o_full$Average_ppb))
big.n2o_full$avg_change <- n2o_full.sub_list - big.n2o_full$Average_ppb
n2o.rate_of_change <- big.n2o_full %>%
ggplot(aes(x = Date, y = avg_change)) +
#geom_line() +
geom_ma(color = "blue", ma_fun = SMA, n = 20, linetype = "solid") +# Plot windowed SMA
geom_ma(ma_fun = DEMA, n = 20, color = "red", linetype = "longdash") + #Plot windowed DEMA
scale_x_date(date_breaks = "20 months", date_labels = "%Y %b %d") +
coord_x_date(xlim = c("2003-12-01", "2021-07-01")) + # Zoom in
labs(x = "Year", y = "Average Change (ppb)",
title = "Average Monthly N2O Changes From 2003 - 2021",
subtitle = "Blue = SMA, Red = DEMA, n = 20") +
theme_minimal(base_line_size = .05) +
theme(text = element_text(size = 16),
axis.text.x=element_text(angle=60, hjust=1))
#n2o.rate_of_change
plot_grid(sf6.rate_of_change, ch4.rate_of_change, n2o.rate_of_change)Figure 3.9: Moving Averages of gas atmospheric concentration from 2003 - 2021 using the smoothed moving average (SMA, red) and double exponential moving average (DEMA, blue) approaches.
So, we can see that each of the gases have their own fluctuations in rate of change over time. SF6 shows a bit more of an upward trend marked by a few dips, whereas CH4 and N2o tend to bounce around a central value. There weren’t any glaring differences among the COVID windows, but SF6 and N2O had somewhat irregular dips around the time of the first reported COVID case.
3.3 NASA Temp Data: Clustering and Correlation
Due to limitations with time and scope, NASA temperature data was analyzed for only data provided by the Global Historical Climatology Network (GHCN). In order to ensure however that results drawn from the GHCN dataset would be reproducible, we first looked at the correlation between the AIRS datasets (for v6 and v7) and the GHCN v4 data:
Code
temp.combin <- lapply(temp.data.converted.long, select_fun) %>% purrr::reduce(left_join, by = c("Year", "Month"))
colnames(temp.combin) <- c("Year", "Month", names(temp.data))
temp.combin %>%
select_if(is.numeric) %>%
select(!Year) %>%
cor(use = "pairwise.complete.obs") %>%
corrplot(method = "shade",
type = "lower",
diag = FALSE,
addCoef.col = "white",
tl.col = "black",
order = "hclust",
hclust.method = "complete")Figure 3.10: Pearson’s linear correlation between temperature anomaly data in the AIRS v6, AIRS v7, and GHCN v4 datasets.
Since the data are highly correlated, it is reasonable to assume that the majority of trends observed for the GHCN data are recapitulated in the AIRS datasets.
In this section, we wished to explore the similarity (or dissimilarity) of years based on their temperature anomalies. We hoped to be able to identify strongly clustered groups, with which we could compare the characteristics of each member year to understand why some years were closer in temperature than others.
3.4 Heatmap and Correlation Analysis
To begin, a heatmap by year was created:
Code
temp.data.monthly <- lapply(temp.data.converted, function(df) df %>% select(contains(c(month.abb))))
for (i in 1:length(temp.data.monthly)) {
rownames(temp.data.monthly[[i]]) <- temp.data.converted[[1]]$"Year"
}
# Use scale on the *columns* of temp.data.monthly since this data is in wide format
# I.e., for each month, we want to get a z-score of each year to determine level of extremeness
# Want z-score of each year by month, not each month by year
# Use t() to get the df such that each column is a year, and each entry is that year
# As a z-score for the month
t(apply(temp.data.monthly[[3]], 2, scale)) %>%
Heatmap(cluster_columns = TRUE,
cluster_rows = FALSE,
row_labels = month.abb,
column_labels = 2002:2022,
name = "Z-score")Figure 3.11: Heatmap of temperature anomalies of years 2002 - 2021 from the NASA GHCN v4 dataset. Years were row-z-scored based on their monthly average temperature anomaly and hierarchically clustered via the complete-linkage algorithm.
Interestingly enough, 2020 seems to be one of the most extreme months in terms of temperature difference relative to the reference, whereas 2021 is more similar by Euclidean distance to the years 2015 to 2018, and is comparatively less extreme than 2020 as indicated by the z-scores.
Note that complete-linkage hierarchical clustering is an agglomerative (bottom-up) clustering algorithm. In other words, the data is clustered by first assuming that each individual data point is its own cluster. Clusters are then iteratively merged with the following algorithm:
- Compute the distance \(D\) between each possible pair of clusters \(X\) and \(Y\), defined as: \(D(X, Y) = \max\limits_{x \in X, y \in Y} d(x, y)\), where \(d(x, y)\) is typically the Euclidean distance between two members of each cluster. In other words, we are finding the farthest neighbors between two clusters
- Identify which two clusters have the smallest distance between them as defined above
- Merge the two clusters
The above process is repeated until all clusters have been merged such that only one cluster remains.
Based on the unexpected result of 2020 being relatively extreme and not being closely similar by Euclidean distance to 2021, we decided to investigate further how years were related to each other by computing the pairwise Pearson’s correlation of monthly temperature anomalies between each year:
Code
# Transpose to get correlation of years against each other, not months against each other
# See https://www.datanovia.com/en/blog/how-to-create-an-interactive-correlation-matrix-heatmap-in-r/
# for interactive correlation plot
temp.data.monthly[["ghcnv4"]] %>%
t() %>%
cor(use = "pairwise.complete.obs") %>%
corrplot(order = "hclust",
hclust.method = "complete",
method = "shade",
addCoef.col = "white",
addrect = 3,
tl.col = "black",
number.cex = 2,
tl.cex = 2,
cl.cex = 2,
rect.lwd = 4)Figure 3.12: Pairwise correlation map of NASA temperature anomaly data for the years 2002 to 2022. Data are ordered based on complete-linkage hierarchical clustering. Rectangles drawn in black on the correlation map show clusters identified via this algorithm. Values closer to 1 indicate positive correlation between monthly temperatures; values closer to -1 indicate a negative correlation.
So it appears that the years 2002, 2007, 2010, 2016, 2017, and 2020 are moderately to strongly correlated and (seemingly) cluster well as identified by complete-linkage agglomeration. Other than that however, the other identified clusters do not seem well-resolved. This may indicate that there is significant fluctuation between years which prevents identification of highly similar years. To investigate further, we employed PCA and k-means clustering for classification of years into groups.
3.4.1 PCA, K-Means Clustering, and Nearest Neighbor Analysis
While we now have a better picture of the relative standings of each year with respect to one another, we seek to explore how these years resolve into distinct groups. Therefore, we will employ PCA to be able to explore as much information of each year as possible without needing to explore \(R^{12}\) space.
First, we use k-means clustering to obtain clusters of years. K-means clustering is an unsupervised algorithms which attempts to classify a dataset into k distinct groups. The algorithm achieves this via the following iterative method:
- Define an initial set of k-means \(m_1^{(0)}, m_2^{(0)}, ..., m_k^{(0)}\), often initialized randomly. These are our initial cluster centroids.
- Assign each observation to the cluster with the nearest mean/centroid by Euclidean distance
- Update the means of each cluster (i.e., re-calculate centroids): \(m_i^{(t + 1)} = \displaystyle \frac{1}{\lvert S_i^{(t)} \rvert} \sum_{x_j \in S_i^{(t)}} x_j\)
Where \(m_i^{(j)}\) represents the mean for cluster \(i\) in the \(j-th\) iteration, and \(S_i^{(t)}\) represents the \(i\)-th cluster in the \(j\)-th iteration. Steps 2 and 3 above are repeated until assignments no longer change - at this point, the algorithm has converged.
To test the optimal value of k, we produce an elbow plot to determine at which point the explained variance does not appreciably increase - to do this, we use the “within sum-of-squares” approach, in which we calculate the sums of squared differences between each observation and its cluster mean to try to optimize capturing the maximum amount of variance without overfitting:
\[\sum_{k=1}^{K} \sum_{i \in S_k} (X_i - \overline{X}_k)^2 \]
Where \(S_k\) is the set of all observations in the \(k\)-th cluster, \(X_i\) is the \(i\)-th observation of this cluster, and \(\overline{X}_k\) is the mean of the \(k\)-th cluster.
Code
# Keep only temperature anomaly data
# Remove 2022 since it has NA values, does not work with k-means
temp.ghcn <- temp.data.monthly[["ghcnv4"]] %>%
select(Jan:Dec) %>%
na.omit()
fviz_nbclust(temp.ghcn, kmeans, method = "wss")Figure 3.13: Elbow plot of k-means clustering analysis, showing the within-sum-of-squares as a function of the number of clusters (k) chosen. An optimal choice of k occurs near the asymptote of WSS
Based on the above, \(k = 3\) was chosen for cluster analysis. Note that since 2022 contains missing values for November and December, 2022 was excluded from analysis as k-means requires observations of the same length. See the below code chunk for the k-means analysis.
Code
set.seed(1)
kmeans.temp <- kmeans(temp.ghcn, centers = 3, nstart = 20)
kmeans.temp## K-means clustering with 3 clusters of sizes 4, 3, 13
##
## Cluster means:
## Jan Feb Mar Apr May Jun
## 1 0.13050000 0.1485000 0.18100000 0.12350000 0.10275 0.16725000
## 2 0.39700000 0.5663333 0.43833333 0.32733333 0.24900 0.14666667
## 3 -0.06038462 -0.0700000 -0.09376923 -0.09146154 -0.08900 -0.06769231
## Jul Aug Sep Oct Nov Dec
## 1 0.19300000 0.11800000 0.1505 0.27625000 0.17825000 0.32600000
## 2 0.19366667 0.20633333 0.1620 0.13633333 0.19500000 0.19166667
## 3 -0.07353846 -0.07761538 -0.0620 -0.07061538 -0.07507692 -0.06923077
##
## Clustering vector:
## 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2
## 2018 2019 2020 2021
## 1 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 0.3334270 0.1938793 1.9750960
## (between_SS / total_SS = 68.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Code
kmeans.temp$cluster## 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2
## 2018 2019 2020 2021
## 1 1 2 1
Finally, to investigate how well our so-identified clusters resolve, we use PCA and color the plot based on the unsupervised tagging from our clustering algorithm:
Code
# Use scale. = TRUE to scale variables to have unit variance
temp.ghcn.clustered <- cbind(temp.ghcn, Group = kmeans.temp$cluster)
temp.ghcn %>%
prcomp(scale. = TRUE) %>%
autoplot(loadings = TRUE,
colour = temp.ghcn.clustered$Group,
loadings.colour = "blue",
loadings.label.colour = "magenta",
loadings.label.repel = TRUE) +
geom_text(vjust = -1,
label = rownames(temp.ghcn.clustered),
color = temp.ghcn.clustered$Group) +
theme(axis.line = element_line(size = 1),
axis.title = element_text(size = 14),
axis.text = element_text(size = 14))Figure 3.14: PCA plot of years 2002-2021. Years are colored based on the identified groups used in k-means clustering for \(k = 3\). Arrows indicate the direction of the original variables as projected onto \(R^2\).
As can be seen above, group 1 (black) and group 2 (red) are relatively well defined, while group 3 (green) is not. The PCA results at least support that which was seen in the correlation plot, but it was relatively unexpected to see such variability in the data. We hypothesized that groups would at least tend to cluster well into consecutive years, given that temperature and atmospheric greenhouse gases concentration have steadily risen in the past several decades.
We conclude this section of EDA with a nearest neighbor analysis, showing both a Euclidean distance matrix and interactive table for nearest neighbors:
Code
temp.ghcn %>%
get_dist() %>%
fviz_dist() %>%
ggplotly()Code
# To extract nearest neighbors, see: https://stackoverflow.com/questions/41770948/
# Also see https://gis.stackexchange.com/questions/320802/
# And see documentation for get.knn - will return the k-nearest neighbors of a df
year.nn <- get.knn(temp.ghcn, k = 1)
# Indices of nearest neighbors are stored in nn.index
nearest.year <- lapply(year.nn$nn.index, function(i) rownames(temp.ghcn)[i])
nearest.year.df <- data.frame(Year = as.numeric(rownames(temp.ghcn)),
"Nearest Neighbor" = as.numeric(unname(unlist(nearest.year))),
Distance = year.nn$nn.dist[, 1],
check.names = FALSE) # To avoid replacing space w/ period
nearest.year.df <- cbind(nearest.year.df,
"Number of Years Apart" = abs(nearest.year.df$Year - nearest.year.df$"Nearest Neighbor"))
# Use datatable; produces sortable, interactive table when knitted
DT::datatable(nearest.year.df,
caption = htmltools::tags$caption(
style = "caption-side: bottom; text-align: center;",
"Table 1: ",
"Interactive table for NASA GISTEMP GHCN v4 data showing
each year, its nearest neighbor year, the Euclidean distance
between years, and the number of years separating each (year,
nearest neighbbor) pair."))